In this paper, we present BlinkDB, a massively parallel, sampling-basedapproximate query engine for running ad-hoc, interactive SQL queries on largevolumes of data. The key insight that BlinkDB builds on is that one can oftenmake reasonable decisions in the absence of perfect answers. For example,reliably detecting a malfunctioning server using a distributed collection ofsystem logs does not require analyzing every request processed by the system.Based on this insight, BlinkDB allows one to trade-off query accuracy forresponse time, enabling interactive queries over massive data by runningqueries on data samples and presenting results annotated with meaningful errorbars. To achieve this, BlinkDB uses two key ideas that differentiate it fromprevious work in this area: (1) an adaptive optimization framework that buildsand maintains a set of multi-dimensional, multi-resolution samples fromoriginal data over time, and (2) a dynamic sample selection strategy thatselects an appropriately sized sample based on a query's accuracy and/orresponse time requirements. We have built an open-source version of BlinkDB andvalidated its effectiveness using the well-known TPC-H benchmark as well as areal-world analytic workload derived from Conviva Inc. Our experiments on a 100node cluster show that BlinkDB can answer a wide range of queries from areal-world query trace on up to 17 TBs of data in less than 2 seconds (over100\times faster than Hive), within an error of 2 - 10%.
展开▼